GridPP: From Prototype to Production
نویسنده
چکیده
GridPP is a £33m, 6-year project funded by PPARC that aims to establish a Grid for UK Particle Physics in time for the turn on of the CERN Large Hadron Collider (LHC) in 2007. Over the last three years, a prototype Grid has been developed and put into production with computational resources that have increased by a factor of 100. GridPP is now about halfway through its second phase, the move from prototype to production is well underway though many challenges remain. 1. The LHC Computing Challenge After more than a decade of work, the world’s highest energy particle accelerator, the Large Hadron Collider (LHC), and the associated detectors come on line in 2007 at CERN in Geneva. With a design luminosity of 800,000,000 proton-proton interactions per second, the 100,000,000 electronic channels embedded in each of the four detectors will produce around 10 Petabytes of data per year. Buried in that landslide of data, perhaps at the level of 1 part in 10, physicists hope to find the rare signature of a Higgs particle. Discovering the nature of the Higgs sector (one or more physically observable particles) allows the origins of mass in the universe to be established. A close relationship has existed between particle physics and computing for the last quarter of a century. Driven by economic, political, and performance issues Particle Physicists have moved from the gold-standard of service and performance provided by mainframes, through smaller institutional based single machines, to modest sized clusters based solutions. The Grid, a global and heterogeneous aggregation of hardware clusters, is the latest step along this path, which strives to minimise the computing cost by the use of commodity hardware; provide scalability to a size beyond that of mainframes; and deliver a quality of service sufficient for the task primarily by relying on redundancy and fault-tolerance to balance the intrinsic unreliability of individual components. The Grid model matches the globally diverse nature of the Particle Physics experiment collaborations, providing politically and financially acceptable solutions to an otherwise intractable computing problem. The data from the LHC detectors are filtered in quasi-real time by dedicated online trigger hardware and software algorithms. The selected raw data will stream from the detectors to the Tier-0 computing centre at CERN and individual trigger streams will also be channelled to specific members of a global network of a dozen Tier-1 centres. The raw data are reconstructed and calibrated in a CPU intensive process, before being catalogued and archived as Event Summary Datasets (ESD). The data are further refined and rarefied to produce Analysis Object Datasets (AOD) and Tagged samples. All these datasets may be used subsequently for data analysis and metadata about the data is also required to be compiled and catalogued. The raw data are complimented by a comparable quantity of simulated data that are generated predominantly at smaller regional Tier-2 sites before being processed in a similar manner to the raw data in order to understand detector performance, calibration, backgrounds, and analysis techniques. The computing requirements are enormous: In 2008, the first full year of data taking, CPU capacity of 140 million SPECint2000 (140,000 3GHz processors), 60 PB of disk storage and 50 PB of mass storage will be needed globally. The hierarchy of Tier centres represents an optimisation of the resources mapped to the functionality and level of service required for different parts of this problem. On the one hand this recognises that there are economise of scale to be gained in the management and operations of computing resources, particularly commodity hardware where there is only basic level vendor support; on the other hand it acknowledges that not all parts of the problem need to the same services or quality of service and that substantial benefits in cost and scale can also be gained by embracing an architecture where institutes, regions, or even Countries, can plugand-play. This, then, is the optimisation afforded by the Grid approach. 2. Overview of GridPP Since September 2001, GridPP has striven to develop and deploy a highly functional Grid across the UK as part of the LHC Computing Grid (LCG)[1]. Working with European EDG and latterly EGEE projects [2], GridPP helped develop middleware adopted by LCG. This, together with contributions from the US-based Globus [3] and Condor [4] projects, has formed the LCG releases which have been deployed throughout the UK on a Grid consisting presently of more than 4000 CPUs and 0.65 PB of storage. The UK HEP Grid is anchored by the Tier-1[5] centre at the Rutherford Appleton Laboratory (RAL) and four distributed Tier-2 [6] centres known as ScotGrid, NorthGrid, SouthGrid and the London Tier-2. There are 16 UK sites which form an integral part of the joint LHC/EGEE computing Grid with 40,000 CPUs and access to 10 PB of storage, stretching from the Far-East to North America. Figure 1: The Global EGEE/LCG Grid 3. Performance Review The current phase of GridPP moves the UK HEP Grid from a prototype to a production platform. Whilst progress can be monitored by milestones and metrics, success can ultimately only be established by the widespread and successful use of substantial resources by the community. Collecting information about Grid use is, in itself, a Grid challenge. GridPP sites form the majority of the EGEE UK and Ireland region (UKI), with RAL as the Regional Operations Centre (ROC). RAL also runs the Grid Operations Centre (GOC) [8] which maintains a database of information about all sites and provides a number of monitoring and accounting tools that provide insight and information. At the basic level, the Site Functional Test (SFT), a small test job that runs many times a day at each site, determines the availability of the main Grid functions. Similarly, the Grid Status Monitor (GStat) retrieves information published by each site about its status. Figure-3 shows the average CPU availability by region for April 2006 derived from sites passing or failing the SFT. Although this particular data set is not complete (and an improved metric is being released in July) it can be seen that within Europe the UKI region (second entry from the left) made a significant contribution with 90% of the total of just of 4000 CPUs being available on average. Figure-3: Average CPU availability for April 2006. CPUs at a site are deemed available (Green) when the site passes the SFT or unavailable (Red) if the site fails. In addition to the GOC database containing Grid-wide information, statistics are also recorded at the RAL Tier-1 centre using Ganglia to monitor CPU load, memory usage and queue data from the batch system. Figure-4 shows the usage by Virtual Organisation (VO) for 2005. Full capacity is roughly the top of the graph so the Tier-1 facility was running around 90% of capacity for the latter half of the year, though about half of this was non-Grid use, predominantly by the b-factory. In order to understand the apparently low CPU utilisation in the first half of 2005, a detailed analysis of batch job efficiency was carried out where the efficiency is the ratio of CPU time to elapsed time. A highly CPU intensive batch job can achieve 95-98% utilisation of a CPU, an I/O intensive job is more likely to be around 8595% utilisation of a CPU, and jobs waiting for busy resources can vary from 0-100% efficient. As can be seen from Figure-5, the overall efficiency was rather low during the second quarter of 2005 until the applications and their data-access patterns were better understood. When CPU time is corrected by job efficiency (to give job elapsed time), it is apparent (Figure6) that the farm ran with greater than 70% occupancy for most of the year, rising to 100% in December. Figure-4: Tier-1 CPU use for 2005. Figure-5: CPU efficiency (CPU/Wall -time). The efficiency has continued to improve in 2006 with many experiments maintaining efficiencies well over 90%. The 2006 efficiency by Virtual Organisation is showing in Figure-7 below (“DTEAM” refers to the development team and debugging leads to low observed efficiency). Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0 10 20 30 40 50 60 70 80 90 100 110
منابع مشابه
GridPP: Meeting the Particle Physics Computing Challenge
GridPP is a £33m, 6 year project, funded by PPARC that aims to establish a Grid for UK Particle Physics in time for the turn on of the CERN Large Hadron Collider (LHC) in 2007. In phase 1, the project established a large-scale prototype Grid across the UK. The project has now entered phase 2, and is developing the prototype testbed into a production service. LHC computing technical design repor...
متن کاملScotGrid: A Prototype Tier 2 Centre
ScotGrid [1] is a prototype regional computing centre formed as a collaboration between the universities of Durham, Edinburgh and Glasgow as part of the UK’s national particle physics grid, GridPP [2]. We outline the resources available at the three core sites and our optimisation efforts for our user communities. We discuss the work which has been conducted in extending the centre to embrace n...
متن کاملPerformance of the UK Grid for Particle Physics
GridPP operates a Grid for the UK particle physics community that is fully integrated with the Enabling Grids for E-sciencE (EGEE) and LHC Computing Grid (LCG) projects. GridPP provides CPU and storage resources at 19 sites across the UK, runs the UKIreland Regional Operations Centre for EGEE, provides Grid-wide configuration, monitoring and accounting information via the Grid Operations Centre...
متن کاملA Grid for Particle Physics – from testbed to production
The GridPP project, in close association with the European DataGrid (EDG) and the LHC Computing Grid (LCG) projects, reached a key milestone this year with the successful deployment of a production Grid testbed. This paper describes the value-added middleware developed to make the testbed function for users across the globe, provides some examples of the use applications have made of it and sha...
متن کاملManaging virtual machines with Vac and Vcycle
We compare the Vac and Vcycle virtual machine lifecycle managers and our experiences in providing production job execution services for ATLAS, CMS, LHCb, and the GridPP VO at sites in the UK, France and at CERN. In both the Vac and Vcycle systems, the virtual machines are created outside of the experiment’s job submission and pilot framework. In the case of Vac, a daemon runs on each physical h...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006